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Abstract 

Estimation methods for mutation rates (or probabilities) in Luria-Delbriick fluctuation analysis usually assume that the final 
number of cells remains constant from one culture to another. We show that this leads to systematically underestimate the 
mutation rate. Two levels of information on final numbers are considered: either the coefficient of variation has been 
independently estimated, or the final number of cells in each culture is known. In both cases, unbiased estimation methods 
are proposed. Their statistical properties are assessed both theoretically and through Monte-Carlo simulation. As an 
application, the data from two well known fluctuation analysis studies on Mycobacterium tuberculosis are reexamined. 
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Introduction 

Since the pioneering work of Luria and Delbriick [1], 
fluctuation analysis has been the object of many studies: see [2- 
7] for reviews. In the past twenty years, the stress has been put on 
the estimation of the expected number of mutations, for which 
reliable methods are now available [8-15]. However, as Stewart 
puts it (p. 1140 of [4]): 

The parameter A [expected number of mutations] is not, in 
itself, of biological interest because the experimenter can 
vary it at will simply by changing the size of the culture 
vessel or the richness of the medium. What he really wants 
to know is not A, but the mutation rate. 

Deriving a mutation rate (i.e. the probability for a mutation to 
occur upon any given cell division) from an expected number of 
mutations seems easy: the former is the quotient of the latter by the 
final number of cells at the end of the experiment. The problem is 
the definition given to "final number of cells". The simplest view is 
expressed by Kendal and Frost (p. 1062 of [2]). 

N is obtained by averaging the final number of cells from 
each parallel culture. 

Other authors have developed a more cautious approach, like 
Foster (p. 198 of [5]). 



The validity of the mutation rate calculation requires that N, 
be the same in each culture. Usually, but not always, this can 
be accomplished by growing cells to saturation. If achieving 
an uniform Nt is a problem, the cell number in each culture 
can be monitored before mutant selection by measuring the 
optical density or by counting cells microscopically (e.g. 
using a Petroff-Hausser chamber). Because there is currently 
no valid method to correct for different N t 's, deviant 
cultures must be eliminated from the analysis. 

Even under the most careful monitoring, final numbers of cells 
vary [16]. Yet, final number data are rarely reported in fluctuation 
analysis experiments, although exceptions exist such as [17,18]. 
Theoretical models considering variations in the population size 
have previously been proposed by Angerer [19] and Komarova et 
al. [20]. Yet, to the best of our knowledge, Foster's assertion that 
"there is currently no valid method to correct for different iV/s" 
remains true to this date. This paper proposes several such 
methods. 

As we shall see, dividing an estimated expected number of 
mutations by a mean final number of cells, induces a negative bias 
on mutation rates. Not only the mutation rate, but also the 
variance of the estimator are underestimated, thus potentially 
inducing wrong conclusions in statistical testing. Two levels of 
knowledge on the fluctuations of final numbers are considered. 
Either the mean and variance of final numbers have been 
estimated separately, or the final number is known for each 
culture. In the first case, if n denotes the estimate of the mutation 
rate assuming constant final numbers, the unbiased estimate 7t u b is 
obtained by: 
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+ (1) 

where /( and C denote the mean and coefficient of variation of the 
final number of cells. When final numbers are known for all 
cultures, better results are obtained by the Maximum Likelihood 
method. The qualities of the proposed estimators have been 
assessed on a simulation study. The impact on real experiments is 
discussed, using Mycobacterium tuberculosis data published by David 
[17], and Werngren & Hoffher [21]. Our R [22] implementation 
of the simulation function and the different estimators is provided 
in File SI. 

Results 

Simulation experiments 

Six different estimates of n were computed on 1 000 simulated 
samples of 50 couples mutant counts - final numbers. Our choice 
for the sample size was motivated by two opposite reasons. On the 
one hand, sample sizes in practice rarely exceed a few tens. On the 
other hand, confidence interval calculations are all based on 
asymptotic normality, which requires the sample size to be large 
enough. A sample size of 50 seemed a reasonable compromise. 
Boxplots for the estimates are represented on Figure 1. The first 
boxplot corresponds to the 1000 estimates by the po-method, 
assuming the mean final number is known; it is negatively biased 
as predicted by the theory. The next boxplot represents estimates 
from the Maximum Likelihood method with known mean final 
number; it is coherent with the previous one, and similarly biased 
as expected. On the next two boxplots, the estimates have been 
multiplied by the unbiasing factor (1). The unbiasing is correct for 
both methods. For the last two boxplots, each estimate has been 
computed using the 50 couples with no prior knowledge on the 
mean and coefficient of variation of final numbers. The best results 
are obtained by the maximum likelihood method (last boxplot). 



The />o-method (label MLP0) performs nearly as well. Since the 
last two boxplots do not use any prior information, one could have 
expected their dispersions to be higher than those of the first four. 
This was not the case, which proves that prior knowledge on the 
distribution of N is not a real improvement over measuring final 
numbers for each culture. 

Each estimation method returns a (theoretical) standard 
deviation, from which confidence intervals can be computed. It 
is is based on a large sample approximation. The sample size in 
current fluctuation analysis experiments usually ranges from 20 to 
50. Since the estimated standard deviation is of high importance 
for statistical decision, it was necessary to check whether 
theoretical standard deviations matched observations. On the 
same samples, the empirical standard deviation of the 1000 
estimates was computed, and compared to the mean value of 
theoretical standard deviations. For each of the estimators, the 
theoretical standard deviation was smaller than the observed one; 
yet, the relative error was smaller than 5%, which validates the 
theoretical value. For instance, the empirical standard deviation 
for the maximum likelihood estimate (rightmost boxplot of 
Figure 1) was 1.85 x 10~ 10 , whereas the theoretical value was 
1.80x 10- 10 . 

Published data sets 

In the two references studied here [17,21], the authors used 
Luria & Delbriick's method of the mean. Luria & Delbrtick [1] 
themselves had remarked that the method is very sensitive to the 
size of jackpots and induces important biases; see also Lea & 
Coulson [23], and Pope et al. [6] for a more recent reference. 

Table 1 reports mutation rate estimates for the data in Table 1 
of David [17]. Since detailed data were not avaible, only the po- 
method could be used. The second column contains the author's 
estimates. The next two columns contain the unbiased po-estimate 
and its 95% confidence interval. Observe that, even though 
confidence intervals are large due to the small sample sizes, the 
author's estimates are outside the confidence interval in 5 cases out 
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Figure 1. Estimates of a mutation rate on 1000 samples of size 50 of pairs mutant counts - final counts. The horizontal line marks the 
true value. The first two boxplots correspond the traditional po- and ML methods, which estimate the expected number of mutations from the sample 
of mutant counts, then divide by the final number of cells, supposed as known. On the next two boxplots, the estimates have been multiplied by the 
unbiasing factor (1). The last two boxplots use the full samples of pairs but no prior knowledge on final numbers. The best results are obtained by the 
maximum likelihood method (last boxplot). The ^-method (label MLPO) performs nearly as well. 
doi:10.1371/journal.pone.0101434.g001 



PLOS ONE | www.plosone.org 



2 



July 2014 | Volume 9 | Issue 7 | e101434 



Unbiased Estimation of Mutation Rates under Fluctuating Final Counts 



Table 1. Mutation rate estimates from Table 1 of [17]. 



Determination 


Author 


Po-method 


Confidence interval 


Isoniazid 1 


1.84 x 10" 8 


2.2 xlO" 8 


[5.8 x 10"'; 3.8 x 10" 8 ] 


Isoniazid 2 


3.5x 1(T 8 


1.1 x 10~ 8 


[5.3x10-'; 1.7 x 10- 8 ] 


Isoniazid 3 


1.7 x 10" 8 


1.3 xlO -8 


[5.1 x 10"'; 2.1 x 10" 8 ] 


Isoniazid 4 


3.2x 10- 8 


8.6x 10- 9 


[4.7x10-'; 1.2xl0" 8 ] 


Streptomycin 1 


0.9 x 10- 8 


5.2 xlO" 9 


[1.9x 10"'; 8.5 x 10-'] 


Streptomycin 2 


5. Ox 10- 8 


6.6 x 10- 9 


[3.9 x 10-'; 9.2 x 10- 8 ] 


Rifampin 1 


1.8 x 10"'° 


3.2x10-'° 


[0.0x10-'°; 9.4x10-'°] 


Rifampin 2 


2.7x 10" 10 


2.9x10-'° 


[0.0x10-'°; 5.8x10-'°] 


Ethambutol 1 


0.7 x 10" 7 


3.3 x 10~ 9 


[9.1x10-'°; 5.6x10-'] 


Ethambutol 2 


1.3xlO- 7 


3.8 x 10- 9 


[2.3 x 10-'; 5.3 x 10-'] 



The author's estimates were calculated by Luria and Delbruck's method of the mean. Our estimates were calculated by the /?o-method. The bias correction (1} was 
applied, with a coefficient of variation C = 0.35 on final numbers. The 95% confidence interval is given in the last column. 
doi:l 0.1 371 /joumal.pone.01 01 434.T.001 



of 10. The most important discrepancies are due the author's use 
of a strongly biased estimation method: when large jackpots 
appear in the mutant counts, as in the Ethambutol cases (last two 
lines of Table 1), the method of the mean may overestimate n by 
several orders of magnitude. The main conclusion of [17] was a 
significant difference in mutation rates, depending on the drug 
(Isoniazid, Streptomycin, Rifampin, or Ethambutol). Indeed that 
difference is confirmed by an ANOVA of the estimated mutation 
rates (P = 0.012). 

Table 2 of David [1 7] contains two paired samples of mutant 
counts and final numbers. All possible estimates were computed. 
Values ranged between 1.81 x 10~ 10 and 2.14 x 10~ 10 . The two 
values that we consider most reliable, obtained by the maximum 
likelihood method, were very similar: 1.98x10- 10 and 
1.97 x 10~ 10 . The estimate reported by the author is 
7.53 x 10~ 10 . Again, the difference is due to the bias induced by 
the author's estimation method. 

Table 2 reports mutation rate estimates by the ML method, 
from data in Table 1 of Werngren & Hoffner [2 1] . The second 
column contains the authors' estimates, calculated by Luria & 
Delbriick method of the mean. The next two columns contain the 
unbiased ML estimate and its 95% confidence interval. Except for 
two strains, the authors' estimate is outside the confidence interval. 
Here, the method of the mean used by the authors has 
underestimated the mutation rate, because of the very small 
number of jackpots in the data. The main conclusion of [21] was 
that no significant difference had been observed between non- 
Beijing strains (first seven lines) and Beijing strains (last six lines). 
Actually, the average mutation rate over the first seven lines is 
4.37 x 10" 8 , over the last six lines it is 2.69 xl0~ 8 . The 
difference is significant at threshold 5% (Welsh Two Sample li- 
test, P= 0.047). 

Discussion 

In any estimation problem, three levels must be distinguished: 
the reality which is and will remain unknown, the mathematical 
model which involves more or less realistic hypotheses, and the 
estimation method. Minimal requirements for an estimator are 
consistence (outputs should be close to the unknown value of the 
parameter), and a computable asymptotic variance (to allow 
statistical inference). Since there is no way to validate all 



mathematical hypotheses that define the model, another quality 
is desirable: robustness. Indeed, designing an estimator for a given 
model and applying it to a different one usually induces a bias: the 
smaller the bias, the more robust the estimator. For mutation rate 
estimates, several sources of bias have been identified, such as cell 
deaths [19,24-26], unknown division time distribution [15], etc. 
Since there is no way to double check estimates on real data, the 
usual approach for evaluating an estimation method consists in 
repeating in silico experiments, i.e. simulate mathematical models 
for a given value of the parameter, estimate that value repeatedly, 
and study the distribution of the obtained estimates. A general 
simulation algorithm described in [15] permits extensive Monte- 
Carlo experiments. 

Usually, only the expected number of mutations is considered as 
the parameter of interest. Among the many estimation procedures 
that have been proposed, we have focused on the />o-method and 
the maximum likelihood (ML); they satisfy the basic requirements 
of statistical inference. As for most other parametric estimation 
problems, the ML method is the most precise. Provided cell deaths 
are neglected, the />o-method stands out as the most robust. 

All estimation methods are valid only if all observed mutant 
counts come from the same Luria-Delbruck distribution, i.e. if they 
have been obtained under a fixed expected number of mutations. 
However, the parameter of real interest which must be considered 
as fixed, is the mutation rate. For each culture the expected 
number of mutations is the product of the mutation rate by the 
final number of cells. Since final numbers vary from one culture to 
another, so do expected numbers of mutations. As shown here, 
applying the po- and ML procedures to the fluctuating final 
number case as if final numbers were constant, induces a bias. 
Two solutions have been proposed. In the case where the final 
numbers of each culture are unknown, but a coefficient of 
variation is available, an unbiasing factor has been defined, and 
validated on simulation experiments. The unbiasing factor (1) 
measures the error induced by neglecting final number fluctua- 
tions: the relative error is of order aC 2 /2 where a = 7t/i is the 
expected number of mutations and C the coefficient of variation of 
final numbers. 

The more favorable case is when final numbers are available. 
Of course measuring the final number of cells for each culture 
leads to reducing the volume of the culture in which the mutants 
are counted, and therefore underestimating mutations. This 
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Table 2. Mutation rate estimates from Table 1 of [21]. 



Strain 


Authors 


ML method 


Confidence interval 


H37Rv 


8.6 x 10"' 


4.8 x 10- 8 


[3.0 x 10- 8 ; 6.6 x 10" 8 ] 


E 865/94 


2.4x 1(T 8 


7.6x 10~ 8 


[4.3 x 10~ 8 ; 1.1 x 10~ 7 ] 


E 729/94 


9.6 x 10" 9 


2.3 x 10~ 8 


[1.3x 10" 8 ; 3.3x 10" 8 ] 


E 740/94 


1.1 x 10- 8 


3.6x 10- 8 


[2.2 x 10- 8 ; 5.0x 10- 8 ] 


E 1221/94 


6.5 x 10" 9 


UxlO" 8 


[7.3 x 10" 9 ; 1.9x 10- 8 ] 


E 1449/94 


1.5 x 10- 8 


4.8 x 10- 8 


[2.9 x 10- 8 ; 6.8 x 10- 8 ] 


Harlingen 


1.4 x 10" 8 


6.2 xlO -8 


[3.8 x 10~ 8 ; 8.6 x 10" 8 ] 


E 26/95 


1.3xl(T 8 


2.3 x 10~ 8 


[1.3x 10~ 8 ; 3.4x 10~ 8 ] 


E 80/95 


7.9 x 10" 9 


2.8 x 10~ 8 


[1.6x 10~ 8 ; 4.0x 10- 8 ] 


E 55 94 


l.Ox ltr 8 


2.0 x 10- 8 


[9.7 x 10-'; 3.1 x 10- 8 ] 


E 26/94 


9.4 x 10~ 9 


3.2 xlO" 8 


[2.2 x 10~ 8 ; 4.3 x 10" 8 ] 


E 3942/94 


i.5x ltr 8 


3.9x 10~ 8 


[2.2 x 10~ 8 ; 5.6 x 10- 8 ] 


E 47/94 


1.2 x 10" 8 


1.8 xlO -8 


[9.2 x 10" 9 ; 2.8x 10" 8 ] 



The authors' estimates were calculated by Luria and Delbruck's method of the mean. Our estimates were calculated by the maximum likelihood method under 
exponential division times. The bias correction (1 ) was applied, using a coefficient of variation C = 0.44 on final numbers. The 95% confidence interval is given in the last 
column. 

doi:1 0.1 371 /journal.pone.01 01 434.t002 



should be accounted for, by proportionally adjusting the estimates 
of final numbers. When coupled mutant counts - final numbers 
have been collected, variants of the po- and ML methods are 
available. Both yield quite precise estimates. As in the constant 
final number case, the po-method is more robust, and almost as 
precise as the ML method. Only the ML method can output 
relative fitness estimates. 

Does the correction for fluctuating final numbers have an 
impact on the interpretation of the data? We have reexamined the 
data in two examples chosen from the literature. In both cases, 
important discrepancies were oberved, that do not only come from 
neglecting final numbers: they are essentially due to the author's 
use of Luria-Delbruck's method of the mean, which is very 
sensitive to jackpots, and can bias the mutation rate estimate by 
several orders of magnitude. In David's paper, the ethambutol 
mutation rate had been estimated around 10~ 7 whereas our 
estimation is of order 10~ 9 . The demonstration is even more 
striking in Werngren and Hoffner's paper. They compared 
mutation rate between Beijing and non Beijing M. tuberculosis 
strains and concluded that it was not different and thus could not 
explain the strong association between Beijing strains and 
multidrug resistance phenotype. However we re-calcutated the 
mutation rate and showed that it was significantly higher for 
Beijing vs. non-Beijing strains. This result is consistent with a 
recent paper [27] showing that lineage 2 (Beijing) M, tuberculosis 
strains have a higher mutation rate than lineage 4 (non-Beijing) 
strains. Given the importance of mutation rates on the risk of 
selection of drug resistant mutants, an accurate evaluation is very 
important. We hope that our results will help improving precision 
in the evaluation of mutation rates. 

Conclusion 

Dealing with classical estimation methods, Foster [5] was right 
in recommending that cultures with deviant final numbers be 
eliminated from fluctuation analysis. Indeed, under varying final 
numbers those methods underestimate mutation rates, and the 
relative bias is proportional to the squared coefficient of variation 



of final numbers. Yet, instead of being discarded as a nuisance, 
variations in final numbers should be added to the available 
information to improve estimation: the best mutation rates 
estimates are obtained when couples mutation count - final 
number are used. 

Two possibilities exist. If mutant counts contain enough zeros 
(say 10% or more), the po-meihod gives reliable results in virtually 
null computer time, and is robust both to relative fitness and 
division time distribution changes. If mutant counts do not contain 
enough zeros, or if an estimate of relative fitness is sought for, then 
the joint estimation of the mutation rate and relative fitness should 
be carried through by the maximum likelihood method. 

We are currently working on an optimized implementation of 
these methods into a forthcoming R [22] package that will be 
made freely available. 

Methods 

Here, /V denotes the final number of cells in a Luria-Delbriick 
fluctuation analysis experiment. Contrarily to the traditional point 
of view [5], fluctuations on /V are considered, i.e. N is viewed as a 
random variable. In the following subsections, different levels of 
information are assumed on the distribution of N: either its 
Laplace transform is known, or only its expectation and variance 
are known, or nothing is known, but the final numbers of cells 
have been measured together with mutant counts for each 
experiment. Notations for the different parameters are summa- 
rized in Table 3. 

As usual, adding a 'hat' to the notation of a parameter denotes 
an estimator of that parameter. We shall consider only strongly 
consistent, asymptotically Gaussian estimators. If 9 is any 
parameter, and s denotes the sample size, then ifs(0 — &) 
converges to a centered Gaussian distribution as s tends to 
infinity. The variance of that distribution, called asymptotic 
variance of 9, will be denoted by Vg. 

In the next four subsections, the focus is on the so-called p$- 
method, introduced by Luria and Delbriick [1] (see also [5,28]). 
The problem of jointly estimating the mutation rate n and the 
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relative fitness p by he maximum likelihood method will be treated 
after. 

Unbiasing /^-estimates 

The final number of cells N is viewed as a random variable with 
probability distribution function G on [0, + co). The distribution of 
N is supposed to be known and its Laplace transform is denoted by 
C. 



£0i)=E[e~ 



'dG(t) 



The />o-method consists of estimating the mean number of 
mutations a by the negative logarithm oi po, then divide by p to 
obtain an estimate of n. 



ot 0 = - log (p 0 ) and 7t 0 = 

Actually, ocq is a consistent estimator of: 

- log(p 0 )= - log(£(7r)) 



0(1 

/< 



The expectation and variance of N are denoted by p and a 2 
respectively. Let U be a random variable, with uniform 
distribution on [0,1], independent from N. The indicator X for 
the mutant count being null is defined as: 

X = I u<e -nN , 

where £4 denotes the indicator of event A (1 if A is true, 0 else). 
Therefore: 

]p[X=l\N = t]=e- n ' , 

and 

p 0 =]p[X=\}=C(n). 



If N is constant, then £(71) = e~ II/ ' = e~", and — log(po) = x: in 
that case So is asymptotically unbiased. If N is not constant, 
because of the convexity of the exponential, and by Jensen's 
inequality, — log(£(7i)) is smaller than a, i.e. So underestimates a, 
and therefore fto underestimates n. 

Denote by C~ the inverse of £ (assumed to be injective). 
Define a new estimator of n by: 

A ub = £- 1 (e-" ft o)=£- 1 (po). (2) 

By construction, n u \, is a strongly consistent estimator of 71, and 
therefore it is asymptotically unbiased. Its asymptotic variance is 
obtained by the traditional delta-method (see e.g. [29]): 
\A0*ub — n ) converges in distribution to the univariate centered 
Gaussian distribution with variance: 



Consider a sample of size s, i.e. s independent copies of X: 
(X\, . . . ,X S ). Denote by po the empirical mean of the Xj's, i.e. the 
relative frequency of zeros among mutant counts. 



P0 = 



By the central limit theorem, \/s(po— pa) converges in 
distribution to the centered Gaussian distribution with variance 
po( \ —po), i-e. po has asymptotic variance Vp 0 =/>o(l — po)- 



--(C'(n))- 2 p 0 (\-p 0 ) 



As expected, if N is constant at p, then p 0 = C(n) = e 
7C u b = tiq, and 

1 ~Po 
ub u P z Po 



This formula is not new: the asymptotic variance of fto appeared 
as formula 35, p. 276 of Lea & Coulson [23]; see also [5,28]. 



Table 3. Parameters and notations for the mathematical model. 





known parameters 


N 


random final number of cells 


C(n) = -E[e-' N \ 


Laplace transform of TV 




expectation of jV 


<r= v / Var[JV] 


standard-deviation of Af 


C = a/ix 


coefficient of variation of N 


unknown parameters 


71 


mutation rate 


a = -Rj.1 


expected number of mutations 


Po=e~* 


probability of zero mutant 


P 


relative fitness of normal cells compared to mutants 


Notations for known and unknown parameters: N denotes a generic random final number of cells. 
doi:1 0.1 371 /journal.pone.01 01 434.t003 
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Families of distributions for which explicit expressions of 7t u b 
and Vft ub can be obtained are scarce. Two examples are given 
below. 

Gamma distributions. They depend on two parameters, 
usually denoted by a and X. The expectation and variance are: 

a , a 

H=- and a = ^ . 

The squared coefficient of variation is the inverse of the shape 
parameter: C 2 = 1 /a. The Laplace transform at n is: 

*»-(£)'■ 

One gets: 

Kb = MPo"~ l ) and f ftub =^ 3 v fto . 



As we shall see in the next subsection, the last two expressions, 
which are exact for inverse Gaussian distributions, hold as a first 
order approximation for any distribution. 

First order approximation 

If the probability distribution of N is known, the bias can be 
exactly corrected by inverting the Laplace transform of N. 
However, this is only a theoretical viewpoint. The best that can be 
hoped for in practice is an estimate of the expectation of N 
together with its variance. It turns out that whatever the 
distribution of N, and provided the product of the coefficient of 
variation by the expected number of mutations remains relatively 
small, the bias can be corrected. Here, we only assume that the 
first two moments of N, \i and a 2 are known, but the full 
distribution of N, and in particular its Laplace transform, remains 
unknown. As we have seen, the expectation of So is — log (C{n)). 
Consider the terms of the series expansion of C{n) in n up to order 
2 (see e.g. [30]): 

C{n) = \-nn+ ¥ ^n 2 +--- 



Expressed in terms of ot 0 , ft and C: Taking negative logarithm, 

1 , 7 s 7 lOg (A")) 

*ub=^2 (exp(a 0 C )-l) and v kub = exp(«C )v ft() . 



= n-— nf + 
H 2/i 



Inverse Gaussian distributions. They depend on two 
parameters, X and ji. The parameter fi is the expectation, and 
the variance is a 2 = fi 3 /X. The squared coefficient of variation is 
C 2 = jx/X. The Laplace transform at n is: 



One gets: 



log(p 0 ) , lo g 2 (Po) 

tub = V - 



H 2X 



and 



v ftub = (l-^log(po)) 2 ^ = (l-flo g (,„)) 2 v ft 



Expressed in terms of ao and C 2 , these expressions become: 



^ub = ^0 1 + 



a 0 C 7 



and 



vti ub = {l+xC 2 ) 2 v jl 



"0 



Expressed in terms of a and C 2 , the relative bias is: 



a 2 aC 2 

1 — 1 , 

2/i 2 



To unbias no, one must divide by the relative bias or (as a first 

6c C 2 

order approximation), multiply by 1 H — . Hence (1): 



ftub = *o 1 + 



The asymptotic variance, obtained through the delta-method is: 

(3) 



v ftub = (l +0! C 2 ) 2 v fto . 



These expressions are exact for inverse Gaussian distributions, 
only approximations for any other distribution. 

To assess the validity range of the unbiasing factor, a simulation 
experiment was conducted. For the same value of 7T=10~ 9 , 
samples of final numbers were simulated with a log-normal 
distribution with mean fi = a/n and coefficient of variation C. The 
values of a ranged from 0.1 to 2, those of C from 0 to 1. The 
results are shown on Figure 2. Red curves show the actual relative 
bias of the po- m ethod; for blue curves, the bias has been corrected 
by the unbiasing factor (1). The correction maintains the bias 
under acceptable values even for relatively large a. and C. 
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The />o-method by maximum likelihood 

In this section, nothing is assumed about the distribution of N. 
A couple (X ,N) of random variables is considered, where X 
represents the indicator of a null mutant count, and N the total 
number of cells at the end of the experiment. The conditional 
distribution of X knowing N = n, is defined as before: 

p[X=l|A r = «]=e-™ . 



Assume that s experiments have been repeated independently, 
yielding s couples (Xi,ni), where Xj is 1 or 0 according to whether 
zero or a positive number of mutants have been counted, and n,- is 
the final number of cells. The likelihood is the probability of the 
observation: 



II(e 

1=1 



The likelihood depends only on the products 7W,. If all «, s are 
divided by a given constant, then the maximum likelihood 
estimator will be multiplied by the same constant. Since the «,'s 
are very large and n very small, rescaling both can make the 
calculation numerically more stable. 

The log-likelihood and its derivatives are: 



d£ 
dn 



(7t) 



dn 2 



(71) 



Y, - ntiiXi + (1 - Xi) log( 1 - e 

* a-xMi 
E -"-*■+ pm/ j > 

i=i e ' — f 



jointly estimated, as the pair (a,p) in the constant final number 
case. 

Here is the mathematical model: for each experiment a pair of 
numbers giving the number of mutants and the final number of 
cells is obtained. An experiment is modelled by a couple (M,N) of 
random variables, where M represents the number of mutants and 
N the total number of cells at the end of the experiment. The 
conditional distribution of M knowing N = n is assumed to be the 
generalized Luria-Delbruck distribution GLD(nn,p,F). The no- 
tation is that of [15]: the expected number of mutations a is the 
product of 7i by the expected final number of cells, the relative 
fitness (ratio of the growth rate of the population of normal cells 
divided by that of mutants) is p, and the distribution of mutant 
division times is given by F. As in [15], we assume that a model 
has been chosen for the distribution of division times, so that only 
the mutation probability n and the relative fitness p are to be 
estimated. 

The sample size being s, for i = 1 , . . . ,s experiment number i 
has yielded a couple (»?,•,«,•), where m, is the mutant count and «, 
is the final number of cells. As in [14,15], we denote by q m (a,p) the 
probability of a mutant count equal to m, under the Luria- 
Delbriick distribution with parameters a (expected number of 
mutations) and p (relative fitness). The computation algorithms of 
the q m (a,p) are well known and will not be reproduced here: see 
[12,14,15]. With that notation, the mutant count at the end of the 
i-th experiment is equal to m, with probability q m . : (mij,p). No 
assumption being made on the final counts, we consider the s- 
tuple of mutant counts (wi), =1 s as a realization of a sample of 
independent random variables. 

The log-likelihood is: 



t(n,p)= J2log(q mj (nn h p)) 



(4) 



i=i 



The maximum likelihood estimator 7t m i is the solution of 

dl 

-j - (Ami) = 0) an d its asymptotic variance is computed from 
dn 



dn 2 



(it) 



(see [29]). This is essentially the method used 



by de la Iglesia et al. [18] in a similar case. 

Bivariate maximum likelihood estimation 

In cases where no null mutant counts have been observed, or if 
an estimate of the relative fitness is desired together with the 
mutation rate, another procedure must be used. Estimating the 
two parameters of a classical Luria-Delbruck distribution by the 
method of maximum likelihood was proposed long ago 
[8,12,31,32]. Using well known explicit formulas, the method 
has been implemented [11,14,33]. In [15] it was shown that 
similar algorithms apply not only to the classical Luria-Delbrtick 
distribution (in which division times are exponentially distributed), 
but also to the so-called Haldane model in which distribution times 
are supposed constant [34,35]. The situation here is only slightly 
different. Instead of being considered as a sample of a fixed Luria- 
Delbrtick distribution, mutant counts can be viewed as indepen- 
dent realizations of different distributions. Denote by LD(a,p) a 
Luria-Delbrtick distribution with expected number of mutations a. 
and relative fitness p. If a pair mutant count - final number (m,n) 
has been observed, m is viewed as a realization of the LD(nn,p), 
and the likelihood is computed accordingly. Thus the pair (n,p) is 



The computation of the gradient and Hessian of i(n,p) are only 
slightly different from those needed for the calculation of the 
maximum likelihood estimates of a and p in the classical case 
[12,14]. In the formulas below, be shall omit the dependence in 
(n,p) for clarity. The first and second derivatives of £ are evaluated 
at (n,p), those of q m are evaluated at (roz,-,p). The gradient is 
computed by: 



de _ ^ n i dq* 
i=i * 



dn 



da 



dp 



= ^q 

,'=1 * 



1 3q m . 
dp 



(5) 



The Hessian is computed by: 



d 2 e 


s n 2 

E — 


<5 tfmj 


_± 


1 dq mj 


dn 2 


i=i qm, 


da 2 


ql H 




d 2 e 


t n< 


d 2 q mi 




fdq m . 


dndp 


i= l Qm l 


dxdp 


ql, 


V 8<x . 


d 2 e 


s 

E 1 


d 2 q mi 




'SqmA 2 


dp 2 




Bp 2 




v ) 



dq mi 
~dp 



(6) 
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relative bias, log-normal final numbers 



o 
d 




0.0 0.2 0.4 0.6 0.8 1.0 

coefficient of variation 

Figure 2. Relative biases on estimates of a mutation rate. Relative biases are plotted as a function of the coefficient of variation C. The 
different curves correspond to 20 values of a=fm from 0.1 to 2. Red curves show biases of the po-method. For blue curves, the bias has been 
corrected by the unbiasing factor (1). The correction maintains the bias under acceptable values even for relatively large a and C. 
doi:1 0.1 371 /journal.pone.01 01 434.g002 

from (5) and (6). It is not the best method by far. We are presently 
working on an optimized implementation, to be included in a 
forthcoming R package. 

Model for simulations 

In the simulation study reported in the Results section, we have 
chosen to draw samples of final numbers according to a log- 
normal distribution with fixed expectation p. and coefficient of 
variation C. Other similarly shaped distributions could have been 
used: gamma, inverse Gaussian, Weibull, etc. Our choice of the 
log-normal was motivated by fitting real data, and by previously 
published results: see [16] and references therein. 

If some value of the mutation rate n has been fixed, and the final 
number of cells N has been simulated, a mutant count can be 
drawn according to a Luria-Delbriick distribution with expected 
number of mutations oc = nN and relative fitness p. As explained in 
[15], an additional choice must be made: that of a probability 
distribution for division times. Neither of the two extreme choices 
that leads to computable versions of the Luria-Delbriick distribu- 



The first and second derivatives of q,„(a,p) in a and p are 
obtained by recursive algorithms that will not be reproduced here 
[12,14]. 

It is a well known fact in statistics, that the most easy looking 
maximum likelihood problem usually conceals algorithmic diffi- 
culties: numeric instability, bad conditionning of the Hessian, etc. 
[36]. Here, the procedure looks straightforward from (5) and (6): 
solving the gradient by a quasi-Newton or conjugate gradient 
method should be done quite efficiently at low computing cost. 
However, depending on the values in the sample, some 
optimization techniques may be more efficient than others. For 
the results described in this article, we have used the statistical 
software R [22], and compared several optimization algorithms: 
quasi-Newton, BFGS, conjugate gradient, simulated annealing 
[37]. The calculation of the Hessian at the maximum likelihood 
solution, which is needed to output asymptotic variances poses a 
numerical problem, already signalled in [14]. For the results of the 
article a numeric evaluation of the Hessian was used instead of (6) 
[37]. In File SI, only the simplest method has been included: it 
consists in solving the gradient by the Raphson-Newton method, 
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tion (exponential and constant division times) is realistic. We have 
chosen the same distribution as in [15]: the best adjustment on 
Kelly and Rahn's observation on Bacterium aerogenes [38]. 

Simulations have been conducted for different sets of param- 
eters. Results are reported for the following values, considered as 
representative: 

ti=10- 9 , /(=10 9 , C = 0.5, p = l . 

One thousand samples of size 50 of pairs (mutant counts - final 
numbers) were simulated. For each sample, six estimates of n were 
computed, together with their theoretical standard deviation. 

• Classical methods: the estimate of the expected number of 
mutations a was computed by two different methods: the /Jo- 
method [1,5,28], and the maximum likelihood (ML) method 
[12,31,32], both applied to the sample of mutant counts. 
Dividing by the expected final number fi, assumed to be 
known, leads to two estimates for n. 

• unbiased estimates: to each of the two previous estimates, the 
unbiasing formulas (1) and (3) were applied, assuming that the 
true value of the coefficient of variation was known which lead 
to two more estimates of n. There again, the expected final 
number p was supposed to be known, as well as the coefficient 
of variation C. 

• po-method on the pairs: no prior information being assumed, the 
maximum likelihood determination of n by the p^-methoA was 
applied to the sample of pairs mutant counts - final numbers. 

• maximum likelihood for n and p: taking againg the sample of pairs 
with no prior information, a joint estimation for % was 
obtained. 
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